push down min/max/count to iceberg by huaxingao · Pull Request #6622 · apache/iceberg

huaxingao · 2023-01-19T02:19:51Z

This PR pushes down min/max/count to iceberg.

Combining #6252 and #6405

huaxingao · 2023-01-19T04:02:06Z

api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java

+    public void update(DataFile file) {
+      if (!isNull) {
+        R value = aggregate.eval(file);
+        update(value);


I removed the null check and did it in update. e.g. MaxAggregate.update

Was this needed for correctness? I'm not sure I understand why you'd need to move it. The NullSafeAggregator was intended to avoid needing to handle null in the update methods. It also used to keep isNull set correctly so that any aggregate that is null would stop calling eval and updating.

If one data file evaluates null, I think we still want to evaluate the rest of the data files. For example,

CREATE TABLE test (id LONG, data INT) USING iceberg PARTITIONED BY (id); INSERT INTO TABLE test VALUES (1, null), (1, null), (2, 33), (2, 44), (3, 55), (3, 66); SELECT max(data) FROM test;

For max(data), the first data file evaluates null, I think we still want to evaluate the rest of the data files to get the max value 66 for max(data).

I am not sure I understand the purpose of isNull in this class then. Looks like we init it and never change?

I see. In that case, I think we need to change isNull to hasValue and return a boolean from update(R).

The intent here was to signal when there is not enough information to produce a value. When there isn't, then the result value should be null, and we can skip pulling values out of rows or data files because we don't have enough information.

For example, if we are processing 3 Parquet files and 1 Avro file, the Avro file may not have a max value. Rather than giving a partial max from the 3 Parquet files, we need update(avroFile) to return hasValue = false so that we stop aggregating.

You're right that this needs to change from my original version, which assumed any null value signaled that there was no maximum. If we know that a file contains only null values, then we can skip it even if it doesn't have an upper bound. Similarly, if we get a null value from a row then we can skip it.

I changed isNull to hasValue. I have also added a flag canPushDown in BoundAggregate to indicate if this aggregate can be pushed down. I think I need a way to differentiate the null: is the null due to stats not available (e.g. complex type) or due to the value is null, so I added this flag.

huaxingao · 2023-01-19T16:22:56Z

@rdblue Could you please take a look when you have time? I am not so sure if I added you as co-author correctly. It looks suspicious.

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java

rdblue · 2023-02-02T16:34:19Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

  public static final boolean PRESERVE_DATA_GROUPING_DEFAULT = false;
+
+  // Controls whether to push down aggregate (MAX/MIN/COUNT) to Iceberg
+  public static final String AGGREGATE_PUSH_DOWN_ENABLED = "spark.sql.iceberg.aggregate_pushdown";


We typically separate words with - rather than _. I think this should also match the other property. How about spark.sql.iceberg.aggregate-push-down-enabled?

Fixed. Thanks

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkLocalScan.java

rdblue · 2023-02-02T16:47:17Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+              .map(agg -> SparkAggregates.convert(agg))
+              .collect(Collectors.toList());
+      aggregateEvaluator = AggregateEvaluator.create(schema, aggregates);
+    } catch (Exception e) {


Can you make this exception more specific? What might be thrown here?

Fixed. Thanks

rdblue · 2023-02-02T16:48:08Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+              .collect(Collectors.toList());
+      aggregateEvaluator = AggregateEvaluator.create(schema, aggregates);
+    } catch (Exception e) {
+      LOG.info("Can't push down aggregates: " + e.getMessage());


This shouldn't swallow the exception by only printing the message. Instead, this should pass the exception to the logger so that it gets printed with the full stack trace, suppressed exceptions, and causes.

Fixed. Thanks

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

rdblue · 2023-02-02T16:51:10Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+
+    List<ManifestFile> manifests = getSnapshot().allManifests(table.io());
+
+    for (ManifestFile manifest : manifests) {


This is going to read all table metadata, which could be really large. Instead, I think this should use scan planning to get the data files. That will allow this to apply filters and skip a lot of data, and it would also parallelize manifest scanning using a ParallelIterator. You'd need to request stats, or else the tasks will be returned without them copied.

Changed to use scan planning to get the data files. Please take a look to see if it's OK.

rdblue · 2023-02-02T17:00:11Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+      }
+    }
+
+    Object[] res = aggregateEvaluator.result();


I think it would be good to check whether the aggregates are non-null and only return if they are valid. Otherwise, this could return different results depending on whether stats are present in the file metadata. To avoid that, we can just detect whether we have a result and abort pushdown if we don't.

I have added the null check, but after a second thought, I removed it.

If one of the column has all null values, then the max or min is also null. We probably still want to push down the aggregate.

I have checked the Metrics mode to disable push down if the mode doesn't have stats. I have also disabled push down for complex types. I am wondering if it's safe without the null check here. If not, I will put back.

@huaxingao, I think the fix is to have a flag in the aggregator that can return whether or not the value is valid. That's what I wanted to use null for here, but you're right that there are cases where the aggregate is value and that value is null because there are no non-null values.

If we keep track of isValid in each aggregator, then the AggregateEvaluator can have a similar method to return whether all aggregates are valid. The we would just abort the aggregation if any value is not known. We can also have an override flag for when you want the closest answer, even if it isn't guaranteed to be correct.

FYI @aokolnychyi

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

rdblue · 2023-02-03T23:05:41Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+    return true;
+  }
+
+  private Snapshot getSnapshot() {


Iceberg does not use get in method names. There's probably a better name here, like readSnapshot().

Fixed, thanks

rdblue · 2023-02-03T23:11:12Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+      // maybe changed, so disable push down aggregate.
+      if (Integer.parseInt(map.getOrDefault("total-position-deletes", "0")) > 0
+          || Integer.parseInt(map.getOrDefault("total-equality-deletes", "0")) > 0) {
+        LOG.info("Cannot push down aggregates when row level deletes exist.)");


Log messages should be more direct. In this case, the main information is that the aggregate pushdown is skipped. The reason why is secondary, but important. Rather than making this a statement that needs to be interpreted ("X is not possible" -> "Iceberg didn't do X"), this should be "Skipped aggregate pushdown: detected row level deletes".

In addition, I think that there are cases where you'd still want an answer from metadata. First, there may not be any matching delete files, so it could be safe. Second, it may be better to get an approximate answer. I think this should be handled using another setting that enables/disables aggregate pushdown when deletes are present. And for the first case, pushdown should only fail if there were delete files returned for at least one FileScanTask (as returned by planFiles called above).

I also support the idea of checking if any matching tasks have deletes and using that instead of relying on generic snapshot metadata.

I have changed the code to check the deletes in the tasks and abort the push down if deletes are present.

I also agree it may be better to introduce another setting to get an approximate number if there are deletes. Probably we can do this in a follow up PR.

rdblue · 2023-02-03T23:12:52Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+    // be used to calculate min/max/count, will enable aggregate push down in next phase.
+    // TODO: enable aggregate push down for partition col group by expression
+    if (aggregation.groupByExpressions().length > 0) {
+      LOG.info("Group by aggregation push down is not supported yet.");


Error message: don't use "yet". It is simply not supported.

It should be possible to do this, but I understand skipping it in the first PR.

Fixed. thanks

aokolnychyi · 2023-02-03T23:54:26Z

I'd love to take a look at this PR on Monday too.

api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java

aokolnychyi · 2023-02-04T00:52:06Z

api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java

+  private final List<BoundAggregate<?, ?>> aggregates;
+
+  private AggregateEvaluator(List<BoundAggregate<?, ?>> aggregates) {
+    ImmutableList.Builder<BoundAggregate.Aggregator<?>> aggregatorsBuilder =


nit: What about a direct import for Aggregator to shorten the lines?

ImmutableList.Builder<Aggregator<?>> aggregatorsBuilder = ImmutableList.builder();

api/src/main/java/org/apache/iceberg/expressions/AggregateEvaluator.java

aokolnychyi · 2023-02-04T01:04:14Z

api/src/main/java/org/apache/iceberg/expressions/BoundAggregate.java

+    public void update(DataFile file) {
+      if (!isNull) {
+        R value = aggregate.eval(file);
+        update(value);


I am not sure I understand the purpose of isNull in this class then. Looks like we init it and never change?

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java

aokolnychyi · 2023-02-04T02:05:44Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

  public static final boolean PRESERVE_DATA_GROUPING_DEFAULT = false;
+
+  // Controls whether to push down aggregate (MAX/MIN/COUNT) to Iceberg
+  public static final String AGGREGATE_PUSH_DOWN_ENABLED = "spark.sql.iceberg.aggregate_pushdown";


spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkAggregates.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

aokolnychyi · 2023-02-04T02:54:58Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+
+    AggregateEvaluator aggregateEvaluator;
+    try {
+      List<BoundAggregate<?, ?>> aggregates =


I'd consider adding another method to SparkAggregates to convert an entire Aggregation. That way, we will be able to simplify this block.

Changed. Thanks

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java

aokolnychyi · 2023-02-04T02:59:10Z

Great work, @huaxingao! I am looking forward on this being merged.

huaxingao · 2023-02-04T03:27:15Z

@aokolnychyi @rdblue Thank you very much for your review! I have addressed most of the comments. Will finish the rest at a later time.

api/src/main/java/org/apache/iceberg/TableScan.java

api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java

core/src/main/java/org/apache/iceberg/TableScanContext.java

rdblue · 2023-02-22T23:28:40Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java

+  }
+
+  @Test
+  public void testAggregateNotPushDownForStringType() {


I think that this test depends on the default metrics mode for string, not on the type being a string itself. If the metrics mode were 'full' so that values aren't truncated, then it would work. You may want to set the metrics mode explicitly for this test and test the case where the metrics are not truncated.

I actually set the metrics to full in the end of the test and tested pushed down

iceberg/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java

Line 363 in caebfae

tableName, TableProperties.DEFAULT_WRITE_METRICS_MODE, "full");

I'm glad to hear that there's a test for full, but the correctness of this test shouldn't rely on the default metrics mode. I think it should explicitly set the string column's mode to truncate.

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

api/src/test/java/org/apache/iceberg/expressions/TestAggregateEvaluator.java

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMergeOnReadDelete.java

rdblue · 2023-02-23T19:17:21Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java

+    List<Object[]> explain = sql("EXPLAIN " + select, tableName);
+    String explainString = explain.get(0)[0].toString();
+    boolean explainContainsPushDownAggregates = false;
+    if (explainString.contains("count(*)".toLowerCase(Locale.ROOT))


It's the explain string that needs to be lower cased in these tests as well.

rdblue

There are a couple of minor issues with the use of string contains in the tests, but overall I think this is correct and almost ready to go in. Thanks, @huaxingao for getting this done!

rdblue · 2023-02-23T19:24:18Z

@huaxingao, I merged this PR, since we can fix some of the minor issues in follow ups and I want to make sure this is in the next release. Thanks!

huaxingao · 2023-02-23T20:22:18Z

@rdblue Thank you very much for helping me on this PR! Really appreciate all your help!
Also thanks @aokolnychyi @singhpk234 @amogh-jahagirdar for reviewing the PR!

huaxingao · 2023-02-23T20:23:22Z

@rdblue Here is the followup.

aokolnychyi

Great work, @huaxingao! I had a few follow-up comments. Sorry it took me so long to get back to the PR.

aokolnychyi · 2023-02-24T21:12:06Z

api/src/main/java/org/apache/iceberg/TableScan.java

+   *
+   * @return a new scan based on this with column stats
+   */
+  default TableScan withColStats() {


Why do we have to add this if we already have includeColumnStats defined in Scan?
I think we should be able to use that.

Right. Sorry I didn't notice there is already an existing method.

No problem at all. I forgot about that one initially too.

core/src/main/java/org/apache/iceberg/TableScanContext.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkAggregates.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkLocalScan.java

(cherry picked from commit 0797b89)

github-actions bot added API spark labels Jan 19, 2023

huaxingao commented Jan 19, 2023

View reviewed changes

amogh-jahagirdar reviewed Jan 19, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Outdated Show resolved Hide resolved

amogh-jahagirdar reviewed Jan 26, 2023

View reviewed changes

singhpk234 reviewed Jan 30, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Outdated Show resolved Hide resolved

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 1, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 1, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/CountNonNull.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 2, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadOptions.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 2, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkLocalScan.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 2, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 2, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 3, 2023

View reviewed changes

aokolnychyi reviewed Feb 4, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Feb 4, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Show resolved Hide resolved

aokolnychyi reviewed Feb 4, 2023

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java Show resolved Hide resolved

github-actions bot added the core label Feb 4, 2023

huaxingao force-pushed the agg_push_down2 branch from ed071e8 to d659a51 Compare February 22, 2023 22:49

rdblue reviewed Feb 22, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/TableScan.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 22, 2023

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/MaxAggregate.java Show resolved Hide resolved

rdblue reviewed Feb 22, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/TableScanContext.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 22, 2023

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java Outdated Show resolved Hide resolved

fix build failure

caebfae

rdblue reviewed Feb 22, 2023

View reviewed changes

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/sql/TestAggregatePushDown.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 22, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Show resolved Hide resolved

rdblue reviewed Feb 22, 2023

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java Outdated Show resolved Hide resolved

address comments

a9ec2a3

huaxingao commented Feb 23, 2023

View reviewed changes

api/src/test/java/org/apache/iceberg/expressions/TestAggregateEvaluator.java Show resolved Hide resolved

huaxingao commented Feb 23, 2023

View reviewed changes

api/src/test/java/org/apache/iceberg/expressions/TestAggregateEvaluator.java Show resolved Hide resolved

fix test failure

c371e70

rdblue reviewed Feb 23, 2023

View reviewed changes

...park-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestMergeOnReadDelete.java Show resolved Hide resolved

rdblue reviewed Feb 23, 2023

View reviewed changes

rdblue approved these changes Feb 23, 2023

View reviewed changes

rdblue merged commit 0797b89 into apache:master Feb 23, 2023

huaxingao deleted the agg_push_down2 branch February 23, 2023 20:22

aokolnychyi reviewed Feb 24, 2023

View reviewed changes

krvikash pushed a commit to krvikash/iceberg that referenced this pull request Mar 16, 2023

Core, Spark: Push down min, max, and count aggregations (apache#6622)

8375ff6

szehon-ho mentioned this pull request Apr 17, 2023

Pushdown min/max/count to partitions metadata table #7365

Closed

zhangbutao mentioned this pull request May 9, 2023

HIVE-27327 : Iceberg basic stats: Incorrect row count in snapshot sum… apache/hive#4301

Merged

findepi mentioned this pull request Oct 17, 2023

Add aggregation pushdown support for count using Iceberg Metrics trinodb/trino#15832

Open

hantangwangd mentioned this pull request Mar 11, 2024

Support partition based metadata query optimization for Iceberg prestodb/presto#22080

Merged

6 tasks

jmaicher mentioned this pull request Feb 28, 2025

Use Iceberg metadata to improve performance of min/max/count aggregations ClickHouse/ClickHouse#76936

Open

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025

Core, Spark: Push down min, max, and count aggregations (apache#6622)

ba4b02b

(cherry picked from commit 0797b89)


		List<ManifestFile> manifests = getSnapshot().allManifests(table.io());

		for (ManifestFile manifest : manifests) {

Conversation

huaxingao commented Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Jan 19, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue Feb 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Feb 3, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

huaxingao commented Jan 19, 2023 •

edited

Loading

rdblue Feb 2, 2023 •

edited

Loading

aokolnychyi Feb 4, 2023 •

edited

Loading